RHOAIENG-30720: Remove GCS FT for Lifecycled RayCluster #892

kryanbeane · 2025-08-27T17:59:16Z

Issue link

What changes have been made

Removed GCS FT from the Lifecycled RayJob implementation. Will discuss this with the team but my reasoning is as follows:

If GCS FT was enabled with RayJobs as the entrypoint, and GCS fails the following will happen:

RayCluster is no longer in RUNNING state and so the RayJob will be set to FAILED but Kuberay Operator
This will tell Kuberay Operator to delete the RayCluster

The only case where a RayJob will retry is if backoffLimit is set in job. This would signal to Kuberay how many times to spin up a new RayCluster for the Job, but it does not try to recover the same RayCluster. - making GCS FT for this useless.

In long lived RayCluster scenario, if GCS fails and FT is enabled, the same RayCluster restarts, a new one is not created and so GCS FT does provide value.

I will include a second PR to fix GCS FT for long-lived RayClusters.

Note: We could propose an issue to get GCS FT working upstream, it would require Kuberay Operator to recognise that GCS FT is enabled for the Lifecycled RayCluster, and try to restart that Cluster instead of spinning up a new one. This logic doesn't exist as far as I could see so we could make an issue for it.

Verification steps

N/A

openshift-ci-robot · 2025-08-27T17:59:21Z

@kryanbeane: This pull request references RHOAIENG-30720 which is a valid jira issue.

In response to this:

Issue link

RHOAIENG-30720

What changes have been made

Removed GCS FT from the Lifecycled RayJob implementation. Will discuss this with the team but my reasoning is as follows:

If GCS FT was enabled and working with RayJobs as the entrypoint:

GCS fails for some reason

RayCluster is not longer in RUNNING state and so the RayJob will be set to FAILED

This will tell Kuberay Operator to delete the RayCluster

The only case where a RayJob will retry is if backoffLimit is set in job. This would signal to Kuberay how many times to spin up a new RayCluster for the Job, but it does not try to recover the same RayCluster. - making GCS FT for this useless.

In long lived RayCluster scenario, if GCS fails and FT is enabled, the same RayCluster restarts, a new one is not created and so GCS FT does provide value.

I will include a second PR to fix GCS FT for long-lived RayClusters.

Verification steps

N/A

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

codecov · 2025-08-27T18:00:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.48%. Comparing base (cb5589c) to head (1f2241e).
⚠️ Report is 1 commits behind head on ray-jobs-feature.

Additional details and impacted files

@@                 Coverage Diff                  @@
##           ray-jobs-feature     #892      +/-   ##
====================================================
+ Coverage             93.45%   93.48%   +0.03%     
====================================================
  Files                    21       21              
  Lines                  1910     1889      -21     
====================================================
- Hits                   1785     1766      -19     
+ Misses                  125      123       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

openshift-ci-robot · 2025-08-27T18:01:29Z

@kryanbeane: This pull request references RHOAIENG-30720 which is a valid jira issue.

In response to this:

Issue link

RHOAIENG-30720

What changes have been made

Removed GCS FT from the Lifecycled RayJob implementation. Will discuss this with the team but my reasoning is as follows:

If GCS FT was enabled with RayJobs as the entrypoint, and GCS fails the following will happen:

RayCluster is no longer in RUNNING state and so the RayJob will be set to FAILED but Kuberay Operator

This will tell Kuberay Operator to delete the RayCluster

The only case where a RayJob will retry is if backoffLimit is set in job. This would signal to Kuberay how many times to spin up a new RayCluster for the Job, but it does not try to recover the same RayCluster. - making GCS FT for this useless.

In long lived RayCluster scenario, if GCS fails and FT is enabled, the same RayCluster restarts, a new one is not created and so GCS FT does provide value.

I will include a second PR to fix GCS FT for long-lived RayClusters.

Note: We could propose an issue to get GCS FT working upstream, it would require Kuberay Operator to recognise that GCS FT is enabled for the Lifecycled RayCluster, and try to restart that Cluster instead of spinning up a new one. This logic doesn't exist as far as I could see so we could make an issue for it.

Verification steps

N/A

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

kryanbeane · 2025-08-28T16:48:33Z

src/codeflare_sdk/common/utils/constants.py

 RAY_VERSION = "2.47.1"
 # Below references ray:2.47.1-py311-cu121
 CUDA_RUNTIME_IMAGE = "quay.io/modh/ray@sha256:6d076aeb38ab3c34a6a2ef0f58dc667089aa15826fa08a73273c629333e12f1e"
+MOUNT_PATH = "/home/ray/scripts"


while i'm here, just moving the new mount path to a constant since it's referenced in so many places

src/codeflare_sdk/ray/rayjobs/rayjob.py

src/codeflare_sdk/ray/rayjobs/config.py

LilyLinh

lgtm. Great work! Thanks Bryan! :)
/approve

pawelpaszki · 2025-09-03T07:53:40Z

looks good to me. ran a sample test against ROSA cluster successfully. Feel free to merge

openshift-ci · 2025-09-03T08:04:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LilyLinh, pawelpaszki

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [pawelpaszki]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kryanbeane requested review from LilyLinh, chipspeak, laurafitzgerald and pawelpaszki August 27, 2025 17:59

openshift-ci bot requested a review from dimakis August 27, 2025 17:59

kryanbeane force-pushed the head-pod-persistance-fix branch 6 times, most recently from b105e4c to fd045a5 Compare August 28, 2025 16:47

kryanbeane commented Aug 28, 2025

View reviewed changes

LilyLinh reviewed Aug 29, 2025

View reviewed changes

src/codeflare_sdk/ray/rayjobs/rayjob.py Show resolved Hide resolved

LilyLinh reviewed Aug 29, 2025

View reviewed changes

src/codeflare_sdk/ray/rayjobs/config.py Outdated Show resolved Hide resolved

kryanbeane force-pushed the head-pod-persistance-fix branch from fd045a5 to 83d2a73 Compare September 1, 2025 08:51

laurafitzgerald force-pushed the ray-jobs-feature branch from 4350b28 to cb5589c Compare September 1, 2025 09:44

kryanbeane force-pushed the head-pod-persistance-fix branch from 83d2a73 to 2c23337 Compare September 1, 2025 10:36

RHOAIENG-30720: Remove GCS FT for Lifecycled RayClusters

1f2241e

kryanbeane force-pushed the head-pod-persistance-fix branch from 2c23337 to 1f2241e Compare September 1, 2025 10:37

LilyLinh approved these changes Sep 1, 2025

View reviewed changes

openshift-ci bot assigned LilyLinh Sep 1, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 1, 2025

pawelpaszki approved these changes Sep 3, 2025

View reviewed changes

openshift-ci bot assigned pawelpaszki Sep 3, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2025

openshift-merge-bot bot merged commit 416ba8d into project-codeflare:ray-jobs-feature Sep 3, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RHOAIENG-30720: Remove GCS FT for Lifecycled RayCluster #892

RHOAIENG-30720: Remove GCS FT for Lifecycled RayCluster #892

Uh oh!

kryanbeane commented Aug 27, 2025 •

edited

Loading

Uh oh!

openshift-ci-robot commented Aug 27, 2025 •

edited by openshift-ci bot

Loading

Issue link

What changes have been made

Verification steps

Uh oh!

codecov bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

openshift-ci-robot commented Aug 27, 2025 •

edited by openshift-ci bot

Loading

Issue link

What changes have been made

Verification steps

Uh oh!

kryanbeane Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

LilyLinh left a comment •

edited

Loading

Uh oh!

pawelpaszki commented Sep 3, 2025

Uh oh!

openshift-ci bot commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RHOAIENG-30720: Remove GCS FT for Lifecycled RayCluster #892

RHOAIENG-30720: Remove GCS FT for Lifecycled RayCluster #892

Uh oh!

Conversation

kryanbeane commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Uh oh!

openshift-ci-robot commented Aug 27, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Uh oh!

codecov bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci-robot commented Aug 27, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Uh oh!

kryanbeane Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LilyLinh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawelpaszki commented Sep 3, 2025

Uh oh!

openshift-ci bot commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kryanbeane commented Aug 27, 2025 •

edited

Loading

openshift-ci-robot commented Aug 27, 2025 •

edited by openshift-ci bot

Loading

codecov bot commented Aug 27, 2025 •

edited

Loading

openshift-ci-robot commented Aug 27, 2025 •

edited by openshift-ci bot

Loading

LilyLinh left a comment •

edited

Loading